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O . In this paper, we propose the MIML (Multi-Instance Multi-Label learning) framework for 
learning with ambiguous objects, where an example is described by multiple instances and 

^ [ associated with multiple class labels. Comparing with traditional learning frameworks, 

. the MIML framework is more convenient and natural for representing ambiguous objects. 

' To learn MIML examples, we propose the MimlBoost and MimlSvm algorithms based 
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on a simple degeneration strategy, and experiments show that solving problems involving 
ambiguous objects in the MIML framework can lead to good performances. Considering 
that the degeneration process may lose information, we propose the D-MimlSvm algo- 



^ . rithm which tackles MIML problems directly in a regularization framework. Moreover, we 

■ show that even when we do not have access to the raw objects and thus cannot capture 

more information from raw objects by using the MIML representation, MIML is still 
useful. We propose the InsDif and SubCod algorithms. InsDif works by transform- 
ing single-instances into the MIML representation for learning, while SubCod works by 
transforming single-label examples into the MIML representation for learning. Experi- 
ments show that in some tasks they are able to achieve better performances than learning 
the single-instances or single-label examples directly. 
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1 Introduction 



In traditional supervised learning, an object is represented by an instance, i.e., a 
feature vector, and associated with a class label. Formally, let X denote the instance 
space (or feature space) and y the set of class labels. The task is to learn a function 
f : X ^ y from a given data set {{xi, yi), {x2, 1/2), ■■ ■ , {x^, Vm)}, where Xi e X 
is an instance and yi & y is the known label of Xi. Although this formalization 
is prevailing and successful, there are many real-world problems which do not fit 
in this framework well. In particular, each object in this framework belongs to 
only one concept and therefore, the corresponding instance is associated with a 
single class label. However, many real-world objects are ambiguous, which may 
belong to multiple concepts simultaneously. For example, an image can belong to 
several classes simultaneously, e.g., grasslands, lions, Africa, etc.; a text document 
can be classified to several categories if it is viewed from different aspects, e.g., 
scientific novel, Jules Verne's writing or even books on traveling; a web page can be 
recognized as news page, sports page, soccer page, etc. In a specific real task, maybe 
only one of the multiple concepts is the right semantic meaning. For example, in 
image retrieval when a user is interested in an image with lions, s/he may be 
only interested in the concept lions instead of the other concepts grasslands and 
Africa associated with that image. The difficulty here is caused by that those 
objects involve multiple concepts; we call such objects as ambiguous objects. To 
choose the right semantic meaning for such objects for a specific scenario is the 
fundamental difficulty of many tasks. In contrast to starting from a large universe 
of all possible concepts involved in the task, it may be helpful to get the subset of 
concepts associated with the concerned object at first, and then make a choice in 
the small subset later. However, getting the subset of concepts, that is, assigning 
proper class labels to such objects, is still a challenging task. 

We notice that as an alternative to representing an object by a single instance, in 
many cases it is possible to represent an ambiguous object using a set of instances. 
For example, multiple patches can be extracted from an image where each patch 
is described by an instance, and thus the image can be represented by a set of 
instances; multiple sections can be extracted from a document where each section 
is described by an instance, and thus the document can be represented by a set 
of instances; multiple links can be extracted from a web page where each link is 



described by an instance, and thus the web page can be represented by a set of 
instances. Using multiple instances to represent those ambiguous objects may be 
helpful because some inherent patterns which are closely related to some labels may 
become explicit and clearer. In this paper, wc propose the MIML [Multi-Instance 
Multi-Label learning) framework for learning with ambiguous objects, where an 
example is described by multiple instances and associated with multiple class labels. 

Comparing with traditional learning frameworks, the MIML framework is more 
convenient and natural for representing ambiguous objects. To exploit the ad- 
vantages of the MIML representation, new learning algorithms are needed. We 
propose the MimlBoost algorithm and the MimlSvm algorithm based on a sim- 
ple degeneration strategy, and experiments show that solving problems involving 
ambiguous objects under the MIML framework can lead to good performance. 
Considering that the degeneration process may lose information, we also propose 
the D-MimlSvm (i.e.. Direct MimlSvm) algorithm which tackles MIML prob- 
lems directly in a regularization framework. Experiments show that this "direct" 
algorithm outperforms the "indirect" MimlSvm algorithm. 

In some practical tasks we do not have access to the raw objects (i.e., the real 
objects themselves such as the real images and the real web pages; a raw object 
may be ambiguous or unambiguous, depending on that whether the raw object 
has multiple class labels or not; traditionally, a raw object is represented by an 
instance after feature extraction), and instead, we are given observational data 
where each ambiguous object has already been represented by a single instance. 
Thus, in such cases we cannot capture more information from the raw objects using 
the MIML representation. Even in this situation, however, MIML is still useful. We 
propose the InsDif (i.e., INStance DIFferentiation) algorithm which transforms 
single-instances into MIML examples to learn. This algorithm is able to achieve a 
better performance than learning the single-instances directly in some tasks. This 
is not strange because for an object associated with multiple class labels, if it is 
described by only a single instance, the information corresponding to these labels 
are mixed and thus difficult to learn; if we can transform the single-instance into 
a set of instances in some proper ways, the mixed information might be detached 
to some extent and thus less difficult to learn. 



MIML can also be helpful for learning single-label objects. We propose the SubCod 
(i.e., SUB-COncept Discovery) algorithm which works by discovering sub-concepts 
of the target concept at first and then transforming the data into MIML examples 
to learn. This algorithm is able to achieve a better performance than learning the 
single-label examples directly in some tasks. This is also not strange because for a 
label corresponding to a high-level complicated concept, it may be quite difficult 
to learn this concept directly since many different lower-level concepts are mixed; 
if we can transform the single-label into a set of labels corresponding to some sub- 
concepts, which are relatively clearer and easier to learn, we can learn these labels 
at first and then derive the high-level complicated label based on them with a less 
difficulty. 

The rest of this paper is organized as follows. In Section 2, we review some related 
works. In Section 3, we propose the MIML framework. Then, in Section 4 we 
propose the MimlBoost and MimlSvm algorithms, and apply them to tasks 
where the objects are represented as MIML examples. In Section 5 we present the 
D-MimlSvm algorithm and compare it with the "indirect" MimlSvm algorithm. 
In Sections 6 and 7, we study the usefulness of MIML when we do not have access 
to raw objects. Concretely, in Section 6, we propose the InsDif algorithm and 
show that using MIML can be better than learning single-instances directly; in 
Section 7 we propose the SubCod algorithm and show that using MIML can be 
better than learning single-label examples directly. Finally, we conclude the paper 
in Section 8. 



2 Related Work 

Much work has been devoted to the learning of multi-label examples under the 
umbrella of multi-label learning. Note that multi-label learning studies the problem 
where a real-world object described by one instance is associated with a number of 
class labelll], which is different from multi-class learning or multi-task learning [28] . 
In multi-class learning each object is only associated with a single label; while in 

^ Most work on multi-label learning assume that an instance can be associated with 
multiple valid labels, but there are also some work assuming that only one of the labels 
among those associated with an instance is correct [34]. 



multi-task learning different tasks may involve different domains and different data 
sets. Actually, traditional two-class and multi-class problems can both be cast into 
multi-label problems by restricting that each instance has only one label. The 
generality of multi-label problems, however, inevitably makes it more difficult to 

address. 

One famous approach to solving multi-label problems is Schapire and Singer's 
AdaBoost.MH [55], which is an extension of AdaBoost and is the core of a 
successful multi-label learning system BoosTexter [55]. This approach maintains 
a set of weights over both training examples and their labels in the training phase, 
where training examples and their corresponding labels that are hard (easy) to 
predict get incrementally higher (lower) weights. Later, De Comite et al. [22] used 
alternating decision trees [29] which are more powerful than decision stumps used in 
BoosTexter to handle multi-label data and thus obtained the AdtBoost.MH 
algorithm. Probabilistic generative models have been found useful in multi-label 
learning. McCallum [46] proposed a Bayesian approach for multi-label document 
classification, where a mixture probabilistic model (one mixture component per 
category) is assumed to generate each document and EM algorithm is employed to 
learn the mixture weights and the word distributions in each mixture component. 
Ueda and Saito [65] presented another generative approach, which assumes that the 
multi-label text has a mixture of characteristic words appearing in single-label text 
belonging to each of the multi-labels. It is noteworthy that the generative models 
used in [46] and [65] arc both based on learning text frequencies in documents, and 
are thus specific to text applications. 

Many other multi-label learning algorithms have been developed, such as decision 
trees, neural networks, fc-nearest neighbor classifiers, support vector machines, etc. 
Clare and King [21] developed a multi-label version of C4.5 decision tree through 
modifying the definition of entropy. Zhang and Zhou [79] presented multi-label 
neural network Bp-Mll, which is derived from the Backpropagation algorithm by 
employing an error function to capture the fact that the labels belonging to an 
instance should be ranked higher than those not belonging to that instance. Zhang 
and Zhou [80] also proposed the Ml-ZcNN algorithm, which identifies the k near- 
est neighbors of the concerned instance and then assigns labels according to the 
maximum a posteriori principle. Elisseeff and Weston [27] proposed the RankSvm 



algorithm for multi-label learning by defining a specific cost function and the cor- 
responding margin for multi-label models. Other kinds of multi-label SvMs have 
been developed by Boutell et al. [11] and Godbole and Sarawagi [32]. In partic- 
ular, by hierarchically approximating the Bayes optimal classifier for the H-loss, 
Cesa-Bianchi ct al. [15] proposed an algorithm which outperforms simple hierar- 
chical SvMs. Recently, non-negative matrix factorization has also been applied to 
multi-label learning [42], and multi-label dimensionality reduction methods have 
been developed [74,85]. 

Roughly speaking, earlier approaches to multi-label learning attempt to divide 
multi-label learning to a number of two-class classification problems [35, 72] or 
transform it into a label ranking problem [27,55], while some later approaches try 
to exploit the correlation between the labels [42,65,85]. 

Majority studies on multi-label learning focus on text categorization [22,32,38,46, 
55,65,74], and several studies aim to improve the performance of text categorization 
systems by exploiting additional information given by the hierarchical structure 
of classes [14,15,52] or unlabeled data [42]. In addition to text categorization, 
multi-label learning has also been found useful in many other tasks such as scene 
classification [11], image and video annotation [37,47], bioinformatics [7,12,13,21, 
27], and even association rule mining [49,63]. 

There is a lot of research on Multi-instance learning, which studies the problem 
where a real-world object described by a number of instances is associated with a 
single class label. Here the training set is composed of many bags each containing 
multiple instances; a bag is labeled positively if it contains at least one positive 
instance and negatively otherwise. The goal is to label unseen bags correctly. Note 
that although the training bags are labeled, the labels of their instances are un- 
known. This learning framework was formalized by Dietterich et al. [24] when they 
were investigating drug activity prediction. 

Long and Tan [43] studied the PAC-learnabihty of multi-instance learning and 
showed that if the instances in the bags are independently drawn from product 

distribution, the Apr (Axis-Parallel Rectangle) proposed by Dietterich et al. [24] 
is PAC-lcarnablc. Aucr ct al. [5] showed that if the instances in the bags arc not 
independent then Apr learning under the multi-instance learning framework is 



NP-hard. Moreover, they presented a theoretical algorithm that does not require 
product distribution, which was transformed into a practical algorithm named 
MULTINST [4]. Blum and Kalai [10] described a reduction from PAC-learning un- 
der the multi-instance learning framework to PAC-learning with one-sided random 
classification noise. They also presented an algorithm with smaller sample com- 
plexity than that of the algorithm of Auer et al. [5] . 

Many multi-instance learning algorithms have been developed during the past 
decade. To name a few. Diverse Density [44] and Em-dd [83], /c-nearest neigh- 
bor algorithms Citation-/cnn and Bayesian-/cnn [67], decision tree algorithms 
Relic [53] and Miti [9], neural network algorithms Bp-mip and extensions [77,90] 
and Rbf-mip [78], rule learning algorithm Ripper-mi [20], support vector ma- 
chines and kernel methods Ml-SVM and Ml-SvM [3], Dd-Svm [18], MiSSSvM [88], 
Mi-Kernel [31], Bag- Instance Kernel [19], Marginalized Mi-Kernel [41] 
and convex-hull method Ch-Fd [30], ensemble algorithms Mi-Ensemble [91], Ml- 
Boosting [70] and MilBoosting [6], logistic regression algorithm Mi-lr [50], 
etc. Actually almost all popular machine learning algorithms have got their multi- 
instance versions. Most algorithms attempt to adapt single-instance supervised 
learning algorithms to the multi-instance representation, through shifting their 
focuses from the discrimination on the instances to the discrimination on the 
bags [91]. Recently there is some proposal on adapting the multi-instance rep- 
resentation to single- instance algorithms by representation transformation [93]. 

It is worth mentioning that the standard multi-instance learning [24] assumes that 
if a bag contains a positive instance then the bag is positive; this implies that there 
exists a key instance in a positive bag. Many algorithms were designed based on 
this assumption. For example, the point with maximal diverse density identified 
by the Diverse Density algorithm [44] actually corresponds to a key instance; 
many SvM algorithms defined the margin of a positive bag by the margin of its 
most positive instance [3, 19]. As the research of multi-instance learning goes on, 
however, some other assumptions have been introduced. For example, in contrast 
to assuming that there is a key instance, some work assumed that there is no key 
instance and every instance contributes to the bag label [17,70]. There is also an 
argument that the instances in the bags should not be treated independently [88]. 
All those assumptions have been put under the umbrella of multi-instance learning. 



and generally, in tackling real tasks it is difficult to know which assumption is the 
fittest. In other words, in different tasks multi-instance learning algorithms based 
on different assumptions may have different superiorities. 

In the early years of the research of multi-instance learning, most work was on 
multi-instance classification with discrete-valued outputs. Later, multi-instance re- 
gression with real- valued outputs was studied [2,51], and different versions of gen- 
eralized multi- instance learning have been defined [57,68]. The main difference 
between standard multi-instance learning and generalized multi-instance learning 
is that in standard multi-instance learning there is a single concept, and a bag is 
positive if it has an instance satisfying this concept; while in generalized multi- 
instance learning [57, 68] there are multiple concepts, and a bag is positive only 
when all concepts are satisfied (i.e., the bag contains instances from every concept). 
Recently, research on multi-instance clustering [82], multi-instance semi-supervised 
learning [48] and multi-instance active learning [59] have also been reported. 

Multi-instance learning has also attracted the attention of the Ilp community. It 
has been suggested that multi-instance problems could be regarded as a bias on 
inductive logic programming, and the multi-instance paradigm could be the key be- 
tween the propositional and relational representations, being more expressive than 
the former, and much easier to learn than the latter [23]. Alphonse and Matwin [1] 
approximated a relational learning problem by a multi-instance problem, fed the 
resulting data to feature selection techniques adapted from propositional represen- 
tations, and then transformed the filtered data back to relational representation for 
a relational learner. Thus, the expressive power of relational representation and the 
ease of feature selection on propositional representation are gracefully combined. 
This work confirms that multi-instance learning can really act as a bridge between 
propositional and relational learning. 

Multi-instance learning techniques have already been applied to diverse applica- 
tions including image categorization [17,18], image retrieval [71,84], text catego- 
rization [3,59], web mining [86], spam detection [36], computer security [53], face 
detection [66,76], computer-aided medical diagnosis [30], etc. 



3 The MIML Framework 



Let X denote the instance space and y the set of class labels. Then, formally, the 
MIML task is defined as: 

• MIML (multi-instance multi-label learning): To learn a function / : 2'^ ^ 2-^ 
from a given data set {{Xi, Yi), (X2, Y2), ■ ■ ■ , (X^, Ym)}, where Xj C is a set 
of instances {xii,Xi2, ■ ■ ■ ,a;j_„-}, Xij G X (j = 1,2, ■■■ ,ni), and C 3^ is a 
set of labels {yii,yi2, ■ ■ ■ Vik & y {k = 1,2, ■■ ■ Here Ui denotes the 
number of instances in Xi and Zj the number of labels in Y^. 

Note that a basic assumption of MIML is that the labels (or some of the labels) have 
some correlation. This is a common phenomenon for objects with multiple labels. 
For example, for an image with labels grasslands, lions and Africa, it is evident 
that these labels have correlation. Moreover, the way in which the instances trigger 
different labels may be different. Some label may be triggered by a key instance, 
some may be triggered by all instances; more generally, labels can be triggered by 
a subset of instances. Just as the different assumptions on the relationship between 
instance-labels and bag-labels in multi-instance learning, in tackling different real 
tasks it is difficult to know which assumption is the fittest one, and algorithms can 
be designed based on different assumptions. 

It is interesting to compare MIML with the existing frameworks of traditional 
supervised learning, multi-instance learning, and multi-label learning. 

• Traditional supervised learning (single-instance single-label learning): To 
learn a function f : X ^ y from a given data set {{xi,yi), {x2, ^2), ■ ' ' 1 i^m, Vm)}, 
where Xi ^ X is an instance and G 3^ is the known label of Xi. 

• Multi-instance learning (multi- instance single-label learning): To learn a func- 
tion f -.2^ ^y from a given data set {{Xi,yi), (X2, ?/2), ■ ■ ■ , {X^, Vm)}, where 
Xi C X is a. set of instances {xn, Xi2, ■ ■ ■ , Xi^m}, ^ij ^ ^ (j = 1; 2, ■ ■ ■ , nj), and 
yi & y is the label of Xj[fJ Here rij denotes the number of instances in Xi. 

• Multi-label learning (single- instance multi- label learning): To learn a function 
f : X 2-^ from a given data set {{xi,Yi), {x2,Y2), ■ ■ ■ , {xm,Ym)}, where 



^ According to notions used in multi-instance learning, {Xi,yi) is a labeled bag while Xi 
an unlabeled bag. 
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Fig. 1. Four different learning frameworks 



ajj G A" is an instance and Fj C 3^ is a set of labels yi2-, - ■ ■ , Vi,!^-, Vik ^ y 
[k = 1, 2, ■ ■ ■ , /j). Here li denotes the number of labels in Yi. 

From Fig. 1 we can see the differences among these learning frameworks. In fact, 
the multi- learning frameworks are resulted from the ambiguities in representing 
real- world objects. Multi- instance learning studies the ambiguity in the input space 
(or instance space), where an object has many alternative input descriptions, i.e., 
instances; multi-label learning studies the ambiguity in the output space (or label 
space), where an object has many alternative output descriptions, i.e., labels; while 
MIML considers the ambiguities in both the input and output spaces simultane- 
ously. In solving real-world problems, having a good representation is often more 
important than having a strong learning algorithm, because a good representa- 
tion may capture more meaningful information and make the learning task easier 
to tackle. Since many real objects are inherited with input ambiguity as well as 
output ambiguity, MIML is more natural and convenient for tasks involving such 
objects. 

It is worth mentioning that MIML is more reasonable than (single-instance) multi- 
label learning. Fig. 2(a) shows a multi-label object which is described by one in- 
stance but associated with I number of class labels. The underlying task is to 
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(a) Multi-label learning: one-to-many mapping (b) MIML: many-to-many mapping 



Fig. 2. Comparing MIML with multi-label learning. In addition to the difference between 
one-to-many mapping and many-to-many mapping, MIML also offers a possibility for 
understanding the relationship between instances and labels, e.g., as shown by the red 
curves, labeli is caused by instance^, label/ is caused by instance^, while labelj is caused 
by the co-occurrence of instancei and instance,. 



learn an one-to-many mapping, yet one-to-many mappings are not proper math- 
ematical functions. It seems feasible to regard every possible combination of the 
multiple labels as a "meta-label" , and then the task becomes to learning an one- 
to-one mapping between the instances and meta-labels. Such a process, however, 
will suffer seriously from the fact that the number of meta-labels is in exponential 
to the number of labels. There may be insufficient amount of training examples 
for learning some meta-labels; more seriously, meta-labels corresponding to label 
combinations which have not appeared in the training set will have none training 
example. Another possibility is to decompose the multi-label learning task into a 
series of two-class problems by treating the labels independently; this has actually 
been adopted by some multi-label learning algorithms as mentioned in Section 2. 
Such a process, however, suffers from the neglect of correlations among the labels. 
Note that in multi-label tasks the labels associated with the same instances are 
usually with informative correlations which can be leveraged for a better perfor- 
mance. In particular, when there are a large number of labels and small amount 
of training examples, the correlation information will be important to complement 
the lack of training data and should not be neglected. Overall, the one-to-many 
mapping might be the major difficulty in dealing with ambiguous objects. If we 
represent the multi-label object using a set of instances, however, as Fig. 2(b) il- 
lustrates, the underlying task becomes to learn a many-to-many mapping which 
is realizable by mathematical functions. So, transforming multi-label examples to 
MIML examples for learning may be beneficial in some tasks, which will be shown 




(a) Africa is a complicated high-level concept 




(b) The concept Africa may become easier to learn through exploiting some sub-concepts 

Fig. 3. MIML can be helpful in learning single-label examples involving complicated 
high-level concepts 

in Section 6. Moreover, note that in some cases, understanding why a concerned 
object has a certain class label is even more important than simply making an 
accurate prediction, while MIML offers a possibility for this purpose. For example, 
the object in Fig. 2(b) has labeli because it contains instance^; it has label; because 
it contains instanccj; while the occurrence of both instancei and instance^ triggers 
labelj . 



MIML can also be helpful for learning single-label examples involving complicated 
high-level concepts. For example, as Fig. 3(a) shows, the concept Africa has a 
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Fig. 4. The two general degeneration solutions. 



broad connotation and the images belonging to Africa have great variance, thus it 
is not easy to classify the top- left image in Fig. 3(a) into the Africa class correctly. 
However, if we can exploit some low-level sub-concepts that are less ambiguous 
and easier to learn, such as tree, lions, elephant and grassland shown in Fig. 3(b), 
it is possible to induce the concept Africa much easier than learning the concept 
Africa directly. The usefulness of MIML in this process will be shown in Section 7. 

4 Solving MIML Problems by Degeneration 

It is evident that traditional supervised learning is a degenerated version of multi- 
instance learning as well as a degenerated version of multi-label learning, while 
traditional supervised learning, multi-instance learning and multi-label learning 
are all degenerated versions of MIML. So, a simple idea to tackle MIML is to 
identify its equivalence in the traditional supervised learning framework, using 
multi-instance learning or multi-label learning as the bridge, as shown in Fig. 4. 

• Solution A: Using multi-instance learning as the bridge: 

The MIML learning task, i.e., to learn a function / : 2"^ — > 2-^, can be 
transformed into a multi-instance learning task, i.e., to learn a function /az/l : 



2-^x3^^ {-1,+1}. For any y e y, fMiL{Xi,y) = +1 H y e and -1 



otherwise. The proper labels for a new example X* can be determined accord- 
ing to Y* = {y\sign[fMiL{X* ,y)] = +1}. This multi-instance learning task 
can be further transformed into a traditional supervised learning task, i.e., to 
learn a function fsisL : X x y ^ { — 1,+!}, under a constraint specifying 
how to derive fMiL{Xi,y) from fsisL{xij,y) (j = 1,2, ■■ ■ ,ni). For any y ey, 
fsiSL{xij,y) = +1 a y E Yi and —1 otherwise. Here the constraint can be 



fMiL{Xi, y) = sign[J2]U fsiSL{xij, y)] which has been used by Xu and Prank [70] 
in transforming multi-instance learning tasks into traditional supervised learning 
tasks. Note that other kinds of constraint can also be used here. 

• Solution B: Using multi-label learning as the bridge: 

The MIML learning task, i.e., to learn a function / : 2'^ — > 2-^, can be trans- 
formed into a multi-label learning task, i.e., to learn a function Jmll '■ 2^ 2-^. 
For any Zi e Z, fMLiiZi) = fMiMiiXi) if Zi = : 2"^ ^ Z. The proper 

labels for a new example X* can be determined according to Y* — fMLL{(t>{X*)). 
This multi-label learning task can be further transformed into a traditional su- 
pervised learning task, i.e., to learn a function fsiSL : Z xy ^ {—1, For 
any y ey, fsiSL{,Zi,y) = +l\iy eYi and -1 otherwise. That is, fuhdzi) = 
{y\fsiSLizi,y) = +1}. Here the mapping (f) can be implemented with construc- 
tive clustering which was proposed by Zhou and Zhang [93] in transforming 
multi-instance bags into traditional single-instances. Note that other kinds of 
mappings can also be used here. 

In the rest of this section we will propose two MIML algorithms, MimlBoost and 
MimlSvm. MimlBoost is an illustration of Solution A, which uses category-wise 
decomposition for the Al step in Fig. 4 and MlBooSTiNG for A2; MimlSvm is an 
illustration of Solution B, which uses clustering-based representation transforma- 
tion for the Bl step and MlSvm for B2. Other MIML algorithms can be developed 
by taking alternative options. 

4.1 MimlBoost 

Now we propose the MimlBoost algorithm according to the first solution men- 
tioned above, that is, identifying the equivalence in the traditional supervised learn- 
ing framework using multi-instance learning as the bridge. Note that this strategy 
can also be used to derive other kinds of MIML algorithms. 

Given any set fl, let \Q\ denote its size, i.e., the number of elements in Q; given 
any predicate vr, let |7r] be 1 if tt holds and otherwise; given {Xi,Yi), for any 
y E y, let \l'(Xj,?/) = +1 if y G and —1 otherwise, where ^ is a function 
^ : 2"^ X ^ — > {—1, -1-1} which judges whether a label y is a proper label of or 



Table 1 

The MimlBoost algorithm 



1 Transform each MIML example {X^, Yu) (n = 1, 2, • • • , m) into |3^| number of multi- 
instance bags {[{Xu,yi),'^{Xu,yi)], - ■ ■ i[{Xu,y\y\),'^iXu,y\y\)]}. Thus, the original 
data set is transformed into a multi-instance data set containing m x |3^| number of 
multi-instance bags, denoted by {[(X®, ^'(XW , y®)]} (i = 1,2,--- ,m x \y\). 

2 Initialize weight of each bag to W'^^^ = ^^ ^ y ^ {i = 1,2, - ■■ , m x |3^|). 

3 Repeat for t = 1, 2, • • • , T iterations: 

3a Assign the bag's label to each of its instances {xj'\y^^^) {i = 1,2, 

• • • ,m X |3^|; j = 1, 2, • • • , rii), set the weight of the j-th. instance of the i-th. bag 
Wj^^ = W^^^ni, and build an instance-level predictor ht[{x^^\y^^'')] G {— 

3b For the z-th bag, compute the error rate e^*) G [0, 1] by counting the number of 

misclassified instances within the bag, i.e. 



3c If < 0.5 for alH G {1, 2, • • • , m x \y\}, go to Step 4. 
3d Compute Cj = argmin^^ S^^i''^' ^^^^ exp[(2e(*) — l)ct]. 
3e If ct < 0, go to Step 4. 

3f Set = W^'''> cxp[(2e(*) — 1)q] (i = 1, 2, • • • , m x 13^1) and re-normalize such 
that < W^'^ < 1 and El^i'^' W^'^ = 1- 

4 Return Y* = {y\sign J2t (^ht[{x* , y)]^ = +1} {xj is X*'s jth instance). 



not. The pseudo-code of MimlBoost is summarized in Table 1. 

In the first step of MimlBoost, each MIML example Yy) (i* = 1, 2, • • • , m) is 
transformed into a set of |3^| number of multi-instance bags, i.e., {[(-'^u; Hi), "^{^u, Hi)], 
[(X„,y2),*(^«,y2)], ••• , [{Xu,y\y\), Note that y^), y^)] 

{v — 1,2, •• • ,1^1) is a labeled multi- instance bag where (X^^y^) is a bag con- 
taining n„ number of instances, i.e., {{xui,yv), {Xu2,yv), ■■■ , {Xu,nu,yv)}, and 
^{Xu, yy) e { — 1, +1} is the label of this bag. 

Thus, the original MIML data set is transformed into a multi-instance data set 
containing m x |3^| number of bags. We order them as [(Xi, yi), yi)], ■■■ , 

[{Xi,y\y\),^{Xi,y\y\)], [{X2,yi),^{X2,yi)], ■■■ , [{Xm,y\y\), ^{Xm,y\y\)], and let 



denote the i-th of these m X |3^| number of bags which 
contains rii number of instances. 

Then, from the data set a multi- instance learning function /mil can be learned, 
which can accomplish the desired MIML function because fMiML{X*) — {y\sign 
[fMiL{X*,y)\ = +1}. In this paper, the MlBooSTiNG algorithm [70] is used to 
implement fuiL- Note that by using MlBooSTiNG, the MimlBoost algorithm 

assumes that all instances in a bag contribute independently in an equal way to 
the label of that bag. 

For convenience, let {B,g) denote the bag [(X, ?/), \E'(X, ?/)], B e B, g e Q, and E 
denotes the expectation. Then, here the goal is to learn a function J^{B) minimizing 
the bag-level exponential loss ii^Bi?g|g[exp(— (yfjF(i?))], which ultimately estimates 
the bag- level log-odds function | log X^g^^j^-, on the training set. In each boosting 
round, the aim is to expand J-{B) into J-{B) -\- cf{B), i.e., adding a new weak 
classifier, so that the exponential loss is minimized. Assuming that all instances in 
a bag contribute equally and independently to the bag's label, f{B) = ^ J^j h{bj) 
can be derived, where h{hj) e {—1,-1-1} is the prediction of the instance-level 
classifier h{-) for the jth instance of the bag and ub is the number of instances 
in B. 

It has been shown by [70] that the best f{B) to be added can be achieved by seek- 
ing h{-) which maximizes '^iYl^Li[-^W^''^^ g'''^^h{h^:l'')\, given the bag-level weights 
W = cxp{—gJ-'{B)). By assigning each instance the label of its bag and the corre- 
sponding weight W^^yrii, h{ ) can be learned by minimizing the weighted instance- 
level classification error. This actually corresponds to the Step 3a of MimlBoost. 
When f{B) is found, the best multipher c > can be got by directly optimizing 
the exponential loss: 



EBEg\B[eM-9HB) + c{-gf{B)))] = exp 



rii 



:^.VI/('')exp[(2e» -l)c] , 



(1) 



where e*^*-* = 'I2jl{h{bj ) ^ (computed in Step 3b). Minimization of this ex- 

pectation actually corresponds to Step 3d, where numeric optimization techniques 
such as quasi-Newton method can be used. Note that in Step 3c if e^*^ > 0.5, the 



Boosting process will stop [89] . Finally, the bag- level weights are updated in Step 
3f according to the additive structure oi J^{B). 



4.2 MimlSvm 

Now we propose the MimlSvm algorithm according to the second solution men- 
tioned before, that is, identifying the equivalence in the traditional supervised 
learning framework using multi-label learning as the bridge. Note that this strat- 
egy can also be used to derive other kinds of MIML algorithms. 

Again, given any set Q, let |Q| denote its size, i.e., the number of elements in Q; 
given {Xi, Yi) and Zi — (t){Xi) where : 2^^ — > Z, for any y ^y, let ^{Zi, y) — if 
y and —1 otherwise, where $ is a function $ : Zxy { — 1,+1}. The pseudo- 
code of MimlSvm is summarized in Table 2. This algorithm assumes that the 
structure of the bags carries relevant information. It tries to identify representative 
training bags by clustering, and then represent all bags by the vector of their 
distances to the representatives. 

In the first step of MimlSvm, the of each MIML example F„) {u = 
1,2, •• • ,m) is collected and put into a data set F. Then, in the second step, k- 
medoids clustering is performed on F. Since each data item in F, i.e. Xu, is an 
unlabeled multi-instance bag instead of a single instance, Hausdorff distance [26] 
is employed to measure the distance. The Hausdorff distance is a famous metric 
for measuring the distance between two bags of points, which has often been used 
in computer vision tasks; other techniques that can measure the distance between 
bags of points, such as the set kernel [31], can also be used here. In detail, given 
two bags A = {oi, ■ ■ ■ , o^ha} B = {bi, 62, • " " > bn^}, the Hausdorff distance 
between A and B is defined as 

cii/(74, i?) = maxjmaxmin ||a — 611, maxmin ||6 — a||| , (2) 

where ||a — 6|| measures the distance between the instances a and b, which takes 
the form of Euclidean distance here. 

After the clustering process, the data set F is divided into k partitions, whose 
medoids are Mt {t = 1,2, ••• ,k), respectively. With the help of these medoids. 



Table 2 

The MimlSvm algorithm 



1 For MIML examples {Xu, 1"„) (u = 1, 2, • • • , m), T = {Xu\u = 1, 2, • • • ,m}. 

2 Randomly select k elements from F to initialize the medoids Mt {t = 1,2, - ■ ■ ,k), 
repeat until all Mt do not change: 

2a Tt = {Mt} {t = 1,2,- ■■ ,k). 

2b Repeat for each X„ G (F - {Mt\t = 1, 2, • • • ,k}): 

index = argmin dniXy,, Mt), T index = ^ index U {Xy,}. 
te{i,-,k} 

2c Mt = argmin X) dH{A,B) {t = 1,2,- ■■ ,k). 
AeTt BeTt 

3 Transform {X^, y„) into a multi-label example Yy_) {u = 1,2, - ■ ■ , m), where 
Zu = {zui,Zu2, ■ ■ ■ , Zuk) = [du [Xu, Mi),dn (X„, 71/2), • • • , dniXy, Mk)). 

4 For each y G 3^, derive a data set Vy = $ (z^, y)) = 1, 2, • • • , m}, and then 
train an SVM hy = SV MTrain{Vy). 

5 Return Y* = {argmax/iy(z*)} U {y\hy{z*) > 0,y G 3^}, where z* = (dij(X*,Mi), 

yey 

dH{X*,M2),--- ,dH{X*,Mk)). 



the original multi-instance example Xu is transformed into a fc-dimensional nu- 
merical vector Zu, where the z-th (i = 1,2, ■■■ component of z„ is the dis- 
tance between X„ and Mi, that is, dniXu, Mi). In other words, z„j encodes some 
structure information of the data, that is, the relationship between X„ and the 
i-th partition of P. This process reassembles the constructive clustering process 
used by Zhou and Zhang [93] in transforming multi-instance examples into single- 
instance examples except that in [93] the clustering is executed at the instance 
level while here it is executed at the bag level. Thus, the original MIML exam- 
ples {Xu,Yu) {u = 1,2, ■ ■ ■ ,m) have been transformed into multi-label examples 
{Zu, Yu) {u = 1, 2, • • • ,m), which corresponds to the Step 3 of MimlSvm. 

Then, from the data set a multi- label learning function /mll can be learned, which 

can accomplish the desired MIML function because fMiMhiX*) = fMLiiz*). In 
this paper, the MlSvm algorithm [11] is used to implement Jmll- Concretely, 
MlSvm decomposes the multi-label learning problem into multiple independent 



binary classification problems (one per class) , where each example associated with 
the label set Y is regarded as a positive example when building SVM for any 
class y E Y, while regarded as a negative example when building SvM for any 
class y ^ Y, as shown in the Step 4 of MimlSvm. In making predictions, the T- 

Criterion [11] is used, which actually corresponds to the Step 5 of the MimlSvm 
algorithm. That is, the test example is labeled by all the class labels with positive 
SvM scores, except that when all the SvM scores are negative, the test example is 
labeled by the class label which is with the top (least negative) score. 

4-3 Experiments 

4-3.1 Multi-Label Evaluation Criteria 

In traditional supervised learning where each object has only one class label, ac- 
curacy is often used as the performance evaluation criterion. Typically, accuracy 
is defined as the percentage of test examples that are correctly classified. When 
learning with ambiguous objects associated with multiple labels simultaneously, 
however, accuracy becomes less meaningful. For example, if approach A missed 
one proper label while approach B missed four proper labels for a test example 
having five labels, it is obvious that A is better than B, but the accuracy of A and 
B may be identical because both of them incorrectly classified the test example. 

Five criteria are often used for evaluating the performance of learning with multi- 
label examples [55,92]; they are hamming loss, one-error, coverage, ranking loss and 
average precision. Using the same denotation as that in Sections 3 and 4, given 
a test set S — {{Xi,Yi), (^2,12), • • • , i^p^Y^p)}, these five criteria are defined as 
below. Here, h{Xi) returns a set of proper labels of Xi; h{Xi, y) returns a real- value 
indicating the confidence for y to be a proper label of Xf, rank^{Xi, y) returns the 
rank of y derived from h{Xi, y). 

• hloss5(/i) = \ Y]i=i -^\\h{Xi)l^Yi\, where A stands for the symmetric difference 
between two sets. The hamming loss evaluates how many times an object- 
label pair is misclassified, i.e., a proper label is missed or a wrong label is pre- 
dicted. The performance is perfect when hloss5(/i) = 0; the smaller the value of 
hloss5(/i), the better the performance of h. 



• one-error5(/i) = ^ J2f^il[a,rgmsiKy^y h{Xi, y)] ^ Y^J. The one- error evaluates how 
many times the top-ranked label is not a proper label of the object. The perfor- 
mance is perfect when one-error5(/i) = 0; the smaller the value of one-error5(/t), 
the better the performance of h. 



• coverage _g(/i) = ^ Yfi=i maxj/eyi rank'^^Xi, y) — 1. The coverage evaluates how far 
it is needed, on the average, to go down the list of labels in order to cover all 
the proper labels of the object. It is loosely related to precision at the level of 
perfect recall. The smaller the value of coverages (/i), the better the performance 
of h. 



• rloss5(/i) = ln=i^\\{{yi.y2)\h{X.,,y,) < h^X^y^), (1/1,1/2) e Y, x Y,}\, 
where Yi denotes the complementary set of Kj in 3^. The ranking loss evaluates 
the average fraction of label pairs that are misordered for the object. The per- 
formance is perfect when rlosss{h) — 0; the smaller the value of Tlosss{h), the 
better the performance of h. 

• fiVP-nrpc rr( h) — 1 V*' ^ V \{y' \rank''{X„y' )<rank''{Xi,y), y eYi}\ rpi nvprnnp nrp- 

• dvgprecgt^Ai; — ^ 2^j=i |y.| l^yeYi rankh{Xi,y) ■ -"-"^ average pre 

cision evaluates the average fraction of labels ranked above a particular label 

y G Yi. The performance is perfect when avgprecs'(/i) = 1; the larger the value 
of avgprec5(/i), the better the performance of h. 

In addition to the above criteria, we design two new multi-label criteria, average 
recall and average Fl, as below. 

• eiygreclg{h) = ^ I]f=i 1^^!^°"^ iXi,y)^\h{Xi)\, y<=Yi}\ ^ r^j^^ average recall evaluates the 
average fraction of proper labels that have been predicted. The performance is 
perfect when avgrecl5(/i) = 1; the larger the value of avgrecl5(/i), the better the 
performance of h. 

• ecvgFl q(h) = '^^^^SP''^'^Cs{h)xa.vgTec[g{h) ^ r^^^ average Fl expresses a tradeoff be- 

^ ^ avgprec5(fe)+avgreclc,(/i) ^ ^ 

tween the average precision and the average recall. The performance is perfect 
when avgFl5(/i) = 1; the larger the value of aygFls{h), the better the perfor- 
mance of h. 



Note that since the above seven criteria measure the performance from different 
aspects, usually one algorithm is difficult to outperform another algorithm on all 
these criteria. 



4.3.2 Scene Classification 

The scene classification data set consists of 2,000 natural scene images belonging 
to the classes desert, mountains, sea, sunset and trees. Over 22% images belong 
to multiple classes simultaneously. Each image has already been represented as a 
bag of nine instances generated by the Sbn method [45], which uses a Gaussian 
filter to smooth the image and then subsamples the image to an 8 x 8 matrix of 
color blobs where each blob is a 2 x 2 set of pixels within the matrix. An instance 
corresponds to the combination of a single blob with its four neighboring blobs 
(up, down, left, right), which is described with 15 features. The first three features 
represent the mean R, G, B values of the central blobs and the remaining twelve 
features express the differences in mean color values between the central blob and 
other four neighboring blobs respectively]^ 

We evaluate the performances of the MIML algorithms MimlBoost and MimlSvm. 
Note that MimlBoost and MimlSvm are just proposed to illustrate the two gen- 
eral degeneration solutions to MIML problems shown in Fig. 4, and we do not 
claim that they are the best algorithms that can be developed through the de- 
generation paths. There may exist other processes for transforming MIML exam- 
ples into multi-instance single-label (MISL) examples or single-instance multi-label 
(SIML) examples. Even by using the same degeneration process as that used in 
MimlBoost and MimlSvm, there are also many alternatives to realize the sec- 
ond step. For example, by using Ml-SvM [3] to replace the MiBoosting used in 
MimlBoost, we get MimlSvm^ and it is also evaluated in our experiments. 

We compare the MIML algorithms with several state-of-the-art algorithms for 
learning with multi-label examples, including AdtBoost.MH [22], RankSvm 
[27], MlSvm [11] and Ml-Zcnn [80]; these algorithms have been introduced briefly 
in Section 2. Note that these are single-instance algorithms that regard each image 



^ The data set is available at http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/annex 
/miml- image-data, htm 



Table 3 



Results (meanibstd.) on scene classification ('|' indicates 'the smaller the better'; '|' 
indicates 'the larger the better') 



Evaluation Criteria 

Compared 



Algorithms 


Moss ^ 


one-error ^ 


coverage ^ 


rloss ^ 










aveprec T 


averecl T 


aveFl T 


MimlBoost 


.193±.007 


.347±.019 


.984±.049 


.178±.011 


.779±.012 


.433±.027 


.556±.023 


MimlSvm 


.189±.009 


.354±.022 


1.087±.047 


.201±.011 


.765±.013 


.556±.020 


.644±.018 


MiMLSVMmi 


.195±.008 


.317±.018 


1.068±.052 


.197±.011 


.783±.011 


.587±.019 


.671±.015 


AdtBoost.MH 


.211±.006 


.436±.019 


1.223±.050 


N/A 


.718±.012 


N/A 


N/A 


RankSvm 


.219±.020 


.400±.063 


1.177±.163 


.225±.041 


.739±.041 


.516±.048 


.608±.045 


MlSvm 


.232±.004 


.447±.023 


1.217±.054 


.233±.012 


.712±.013 


.073±.010 


.132±.017 


Ml-A;nn 


.191±.006 


.370±.017 


1.085±.048 


.203±.010 


.759±.011 


.407±.026 


.529±.023 



as a 13 5- dimensional feature vector, which is obtained by concatenating the nine 
instances in the direction from upper-left to right-bottom. 

The best performed parameters reported in [27], [11] and [80] are used for RankSvm, 
MlSvm and Ml-A;nn, respectively. The boosting rounds of AdtBoost.MH and 
MimlBoost are set to 25 and 50, respectively; it can be observed from Fig. 5 that 
at those rounds the performances of the algorithms have become stable. Gaussian 
kernel LiBSVM [16] is used for the Step 3a of MimlBoost. The MimlSvm and 
MimlSvm^j are also realized with Gaussian kernels. The parameter k of MimlSvm 
is set to be 20% of the number of training images; it can be observed from Fig. 6 that 
the setting of k does not significantly affect the performance of MimlSvm. Note 
that in Figs. 5 and 6 we plot 1 — average precision, 1— average recall and 1— average 
Fl such that in all the figures, the lower the curve, the better the performance. 

Here in the experiments, 1,500 images are used as training examples while the 
remaining 500 images are used for testing. Experiments are repeated for thirty runs 
by using random training/test partitions, and the average and standard deviation 



are summarized in Table 3 
been highlighted in boldface. 







where the best performance on each criterion has 



Pairwise t-tests with 95% significance level disclose that all the MIML algorithms 
are all significantly better than AdtBoost.MH and MlSvm on all the seven 
evaluation criteria. This is impressive since as mentioned before, these evaluation 



Ranking loss, average recall and average Fl are not available for AdtBoost.MH. 
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Fig. 5. Performances of MimlBoost and AdtBoost.MH at different rounds. 
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(a) hamming loss (b) one-error (c) coverage (d) ranking loss 



(e) 1— average precision (f) 1— average recall (g) 1— average Fl 

Fig. 6. Performances of MimlSvm with different A; values. 



criteria measure the learning performance from different aspects and one algo- 
rithm rarely outperforms another algorithm on all criteria. Both MimlSvm and 
MlMLSvMmi are significantly better than RankSvm on all the evaluation criteria. 
MimlBoost is significantly better than RankSvm on the first five criteria. Both 
MimlBoost and MiMLSvMmj are significantly better than Ml-/cnn on all criteria 
except hamming loss. MimlSvm is significantly better than Ml-A;nn on one-error, 
average precision, average recall and average Fl, while there are ties on the other 
criteria. Moreover, note that the best performances on all evaluation criteria are 
always attained by MIML algorithms. Overall, comparison on the scene classifi- 
cation task shows that the MIML algorithms can be significantly better than the 
non-MIML algorithms; this validates the powerfulness of the MIML framework. 



4-3.3 Text Categorization 



The Reuters-21578 data set is used in the experiment. The seven most frequent 
categories are considered. After removing documents that do not have labels or 
main texts, and randomly removing some documents that have only one label, 
a data set containing 2,000 documents is obtained, where over 14.9% documents 
have multiple labels. Each document is represented as a bag of instances according 
to the method used in [3]. Briefly, the instances are obtained by splitting each 
document into passages using overlapping windows of maximal 50 words each. As 
a result, there are 2,000 bags and the number of instances in each bag varies from 
2 to 26 (3.6 on average). The instances are represented based on term frequency. 
The words with high frequencies are considered, excluding "function words" that 
have been removed from the vocabulary using the SMART stop-list [54]. It has 
been found that based on term frequency, the dimensionality of the data set can 
be reduced to 1-10% without loss of effectiveness [731. Thus, we use the top 2% 

n 

frequent words, and therefore each instance is a 243-dimensional feature vector J ^ I 

The compared algorithms are as same as those in Section 4.3.2. Linear kernels are 
used. The single-instance algorithms regard each document as a 243-dimensional 
feature vector which is obtained by aggregating all the instances in the same bag; 
this is equivalent to represent the document using a sole term frequency feature 
vector. 

Here in the experiments, 1,500 documents are used as training examples while 
the remaining 500 documents are used for testing. Experiments are repeated for 
thirty runs by using random training/ test partitions, and the average and standard 
deviation are summarized in Table 4, where the best performance on each criterion 
has been highlighted in boldface. 

Pairwise t-tests with 95% significance level disclose that, impressively, both MimlSvm 
and MiMLSvMmj are significantly better than all the non-MIML algorithms. MlML- 
BooST is significantly better than AdtBoost.MH on all criteria except that there 
is a tie on hamming loss; significantly better than RankSvm on all criteria; sig- 
nificantly better than MlSvm on average recall and there is a tie on average Fl ; 



^ The data set is available at 



http: / / cs.nju.edu.cn/zhouzh/zhouzh.files / publication/annex/ miml-text-data.htn 



Table 4 

Results (meanzbstd.) on text categorization ('|' indicates 'the smaller the better'; '|' 
indicates 'the larger the better') 



Evaluation Criteria 

Compared 



Algorithms 


hloss ^ 


one-error ^ 


coverage ^ 


rloss ^ 


aveprec T 


averecl ^ 


aveFl T 


MimlBoost 


.053±.001 


.094±.014 


.387±.037 


.035±.005 


.937±.008 


.792±.010 


.858±.008 


MimlSvm 


.033±.003 


.066±.011 


.313±.035 


.023±.004 


.956±.006 


.925±.010 


.940±.008 


MlMLSVM^j 


.041±.0O4 


.055±.009 


.284±.030 


.020±.003 


.965±.005 


.921±.012 


.942±.007 


AdtBoost.MH 


.055±.005 


.120±.017 


.409±.047 


N/A 


.926±.011 


N/A 


N/A 


RankSvm 


.157±.031 


.228±.192 


.801±.780 


.104±.128 


.846±.145 


.267±.131 


.393±.164 


MlSvm 


.050±.003 


.OSli.Oll 


.329±.029 


.026±.003 


.949±.006 


.777±.016 


.854±.011 


Ml-Zcnn 


.049±.003 


.126±.012 


.440±.035 


.045±.004 


.920±.007 


.821±.021 


.867±.013 



significantly better than Ml-Zcnn on one-error, coverage, ranking loss and aver- 
age precision. Moreover, note that the best performances on all evaluation criteria 
are always attained by MIML algorithms. Overall, comparison on the text cate- 
gorization task shows that the MIML algorithms are better than the non-MIML 
algorithms; this validates the powerfulness of the MIML framework. 



5 Solving MIML Problems by Regular izat ion 

The degeneration methods presented in Section 4 may lose information during the 
degeneration process, and thus a "direct" MIML algorithm is desirable. In this 
section we propose a regularization method for MIML. This method considers the 
loss between the labels and the predictions on the bags as well as on the constituent 
instances. It also considers the relatedness between the labels associated to the same 
example. Moreover, considering that for any class label the number of positive 
examples is much fewer than that of negative examples, this method incorporates 
a mechanism to deal with class imbalance. We employ the constrained concave- 
convex procedure (Cccp) which has well-studied convergence properties [61] to 
solve the resultant non-linear optimization problem. We also present a cutting 
plane algorithm that finds the solution efficiently. In contrast to MimlSvm, this 
method is developed from the regularization framework directly and so we call it 
D-MimlSvm. 



5.1 The Loss Function 



Suppose ft{Xi) can judge that whether the t-th element of is a proper label of 
Xi or not. By learning such a function for every element in y, i.e., by learning 
/ — (/i) /2, • • • ) /t), the proper labels of an unseen bag X* can be determined. 
Here T is the number of labels in y. It is worth noting that each instance 

bag can be viewed as a bag {Xi j} containing only one instance, and so f{{Xij}) is 
also a well-defined function for each instance Xij. For convenience, we write f{xij) 
for f{{xij}). 

In order to learn ft, it is needed to consider the relationship between the bag X^ and 
its instances {xn, Xi2, • • • , Xi^m}- In multi-instance learning, it is usually assumed 
that the strength for Xi to hold a label is equal to the maximum strength for its 
instances to hold the label, i.e., ft (Xi) — max ft (xij) . The above requirement, 

j=l,---,ni 

however, is usually too restrictive. One reason is that there are many cases where 
the label of the bag does not only rely on the instance with the maximum predic- 
tion, as discussed in Section 2; another reason is that, in classification only the sign 
of prediction is important [19], i.e., sign{ft {X^)) = sign{ max ft (Xij)). Thus, we 
introduce the general loss function V for MIML setting as shown in Eq. 3. 



1 m T 

i=l t=i 



\ rn T 

' (3) 



+ ^ E E M (^0 > . max ft {xij) 



The loss function V involves two parts balanced by A. The first part considers the 
loss between the bag XiS labels and its corresponding predictions f{Xi), such as 
the hinge loss (1 — yitft{Xi))^ where {z)j^ — max(0, z). The second part considers 
the loss between f{Xi) and the predictions of XiS constituent instances {f{xij)}, 
such as / ( {Xi) , max ft{Xij) ]. Here l{vi,V2) can be defined in various ways, 
such as li loss 

1{V1,V2) ^\vi-V2\. (4) 



5.2 Representer Theorem for MIML 



For simplicity, we assume that each function ft is a hnear model, i.e., ft{x) — 
(wt, (f){x)) where (f) is the feature map induced by a kernel function k and (•, •) de- 
notes the standard inner product in the Reproducing Kernel Hilbert Space (RKHS) 
Ti. induced by the kernel k. Remind that an instance can be regarded as a bag con- 
taining only one instance, the kernel k can be any kernel defined on set of instances, 
such as the set kernel [31]. In the case of classification, objects (bags or instances) 
are classified according to the sign of ft- 

It is evident that the labels associated with a bag should have some rclatedness; 
otherwise they should not be associated with the bag simultaneously. Inspired 
by [28], we assume that all the WtS come from a particular Gaussian distribution, 
with the mean denoted as Wq, 

1 ^ 

'^0^ Tf^Y^-^t- (5) 

t=l 

The original idea in [28] is to minimize Yh=i ll'^'i ~ '"^olP s-nd meanwhile minimize 
||tUo||^, which is a bit complicated. Note that, according to Eq.5 we have 

\\wt - wof = E - nwoW" . (6) 

t=i t=i 

Therefore, minimizing W'^i ~ "^oH^ and meanwhile minimizing ||tOo||^ can be 
simplified by minimizing J2tLi \ \wt\\^ and ||iOo||^ simultaneously. 

Further note that ^^=1 W'^tf = E^i WftWn and \\wof = using Eq. 3 

and Wq in Eq. 5, we have the regularized framework for MIML in Eq. 7, 

min;^i:i|/,||^ + //||%^||^ + 7^(TOr=i,TO^^^ , (7) 
J- t=i ^ 

where 7 is a regularization parameter that balances the model complexity and the 
empirical risk, and is a parameter to trade off the discrepancy and commonness 
among the labels. Intuitively, when jj, is large, the commonness among the labels 
is more important, and vice verse. 

Given the above setup, we can prove the following representer theorem. 



Theorem 1 The minimizer of the optimization problem Eq. 7 admits an expansion 

m / rii ^ 

ft{x) = at,iok {x, Xi) + oit,ijk{x, ccy) 
i=i \ j=i 

where all aao, o-t ij e n. 

Proof. Analogous to [28], we first introduce a combined feature map 



■\/r 



,0,- - ,0,(/.(a;),0, -- ,0 



\ t-l T-t 

and its decision function, i.e., f{x,t) = {w,'i/{x,t)) wliere 

w = {\/rwo, Wi - Wo, - ■ ■ ,Wt - Wo). 

Here r = /iT + T. Let k denote the kernel function induced by ^ and H is its 
corresponding RKHS. We have Eqs. 8 and 9. 

f(x, t) = {w, ^(x, t)) = {(wo + wt- Wo), (t>{x)) = {wt, (t>{x)) = ft{x) (8) 



T 



Y\\wt- Wof + r\\wo\\^ = Y\\wt\\^ + ijT\\wo\\^ (9) 

i=l 1=1 



Therefore, loss function in Eq.3 can be represented by /), i.e.. 



1 m T 

i=i t=i 

\ m T . s 
+ -^1212HfiX^,t),.^^ f{^r3.t))- (10) 



Thus, Eq. 7 is equivalent to 

min I/I \'n + 7V^(TOr=i, {Yi}T=i, /)■ (H) 
fen T 

Note that r2(||/||^) = ll/ll^ : [0, C)o) ^ 7^ is a strictly monotonically increasing 
function. According to rcprcscnter theorem (Theorem 4.2 in [56]), each minimizer 
/ of the functional risk in Eq. 1 1 admits a representation of the form 

T rn I Hi \ 

j{x, t) = E E A,io^ ((^i' ^) ' (a^' ^)) + E (^t,iik {{x,^, t) , {x, t)) , (12) 
t=ii=i \ i=i / 



where (5t,ij e TZ and the corresponding weight vector w is represented as 



T m 



^ = E E /^Mo* {X,, t) + Y: Pt,ij^ t) . (13) 

t=li=l \ j=l J 

Finally, with Eqs. 8 and 13, we have 

ft{x) = {wt,(f)ix)) = {w,^{x,t)) 

m / rii \ 

= E "Mo^ (a^, Xi) + ^ at,ijk{x, x^) (14) 
i=i V j=i / 

where at,ij = ^(Et A,ij) + Pt,ij/r. □ 

Note that a; in Eq. 14 can be regarded not only as a bag Xi but also an instance 
Xij. In other words, both ft{Xi) and ft{xij) can be obtained by Eq. 14. 

5.3 Optimization 

Considering the use of h loss for l{vi,V2), Eq.7 can be re- written as 

■'- II ^ l|2 I |iEt=l/t||2 , 7 /./-, , 7-^ c/-, 

mm — > /t -H + « — — -H H ;^;C 1 H 7^0 1 

/ew,t«5T^"''*"^ ^" T mT^ mT 

s.t. yuMXi) > 1 - ^it, 

-(^it < MXi) - max /t(a;ij) < Su (15) 

j=l,...,ni 

where ^ = [^n, ^12, ■ ■ ■ , ^u, ■ ■ ■ , ^mx]' are slack variables for the errors on the train- 
ing bags for each label, S — [Su, Su, ■ ■ ■ ,Sit, - ■ ■ , Smr]', and and 1 are all-zero 
and all-one vector, respectively. 

Without loss of generality, assume that the bags and instances are ordered in the 
order (Xi, • • • , X^, ajn, • • • , Xi^^: • • • , aJ^.i, • • • , Xm,nm)- Thus, each object (bag 
or instance) in the training set can then be indexed by the following function I, 
i.e., 

T{Xi) = i 

i-l 

T{Xij) =m+ J2ni+j 
1=1 



for j' = 1, • • ■ ) '^i ^-iid ^ — 1) • ■ ■ ) With this ordering, we can obtain the (m + 
n) X (m + n) kernel matrix K defined on all objects in the training set, where 
n — Z^^i rii. Denote the i-th column of K by fcj. We have ft{Xi) — kj,^Xi)^t + W 
and ft{Xij) = k^^..^cxt + bf. Here, the bias bf for each label is included. 

The weight vector corresponding to ft can be re- written as Wt — where $ 
is the matrix with all the mapped bags and instances stacked as columns, thus 
= K. 

According to definition of ft in Eq. 14, Eq. 15 can be cast as the optimization 
problem 

min -^Ya'Kcxt + ^l'A'KAl + ^^'l + ^d'l (16) 

s.t. yu{k'j^x,)(^t + bt)>l- iiu 
^ >0, 

^'x{a^r,)f^t - Sit < fex(Xi)"^ 
^'i{x,)(^t - max k'j(^ )OLt < Su, 

where A — [oci, a2, • • • , cxt] and b — 62, • • • , ^'r]'- 

The above optimization problem is a non-convex optimization problem since the 
last constraint is non-convex. Note that this non-convex constraint is a difference 
between two convex functions, and thus the optimization problem can be solved 
by CCCP [19,61], which is one of the most standard technique to solve such kind 
of non-convex optimization problems. CcCP is guaranteed to converge to a local 
minimal [75], and in many cases it can even converge to a global solution [25]. 

In particular, CcCP needs to solve a sequential convex quadratic problem. Suppose 
given the initial subgradient J^jLi Pijtkj(^^..^cxt of ma.Xj=i^... kj^^,.^cxt, then we 
solve the convex quadratic optimization problem 

min —ycx[Kcxt + A,l'A'KAl + ^^'l + ^d'l (17) 
s.t. yit{k^Xi)Oct + bt)>l- ^it, 

^J(a,,,)«t - Sit < 

k'l{Xi)'^t - ^j^iPi3tk'l{xij)^t < Sit- 



In the next iteration, we update pijk according to 

= 0, if k'j:^^,.)Oit ^ ^ max^^ (fex(.,,)«t) , 
= l/udi otherwise, 



where rid is the number of active ajjj's. It holds pijt — 1 for any fs. This 
procedure is guaranteed to converge to a local minimum. 



5.4 Handling Class- Imbalance 

The above solution may be improved further if we explicitly take into account the 

instance-level class-imbalance, that is, for any class label the number of positive 
instances is much fewer than the number of negative instances in MIML problems. 

We can roughly estimate the imbalance rate, which is the ratio of the number of 
positive instances to that of negative instances, for each class label using the strat- 
egy adopted by [40]. In detail, for a specific label y E y, we can divide the training 
bags {(Xi,ri),(X2,F2),-- - ,(^m,i^m)} mto two subsets, A = {{Xi,Yi)\y e Yi} 
and A2 — {{Xi,Yi)\y ^ Yj}. It is obvious that all the instances in A2 are negative 
to y. Then, for every (Xj, Yi) in Ai, assuming that the instances of different labels 
is roughly equally distributed, the number of positive instances of y in {Xi,Yi) is 
roughly nj x where \Yi\ returns the number of labels in Yi. Thus, the imbalance 
rate of y is: 

71- 1 T7- 

tbr iy)= 2^ ttttX ih — = L 



i=l l^il V m i=l ^ ^ l^'l 

There are many class-imbalance learning methods [69] . One of the most popular and 
effective methods is rescaling [87], which can be incorporated into our framework 
easily. In short, after obtaining the estimated imbalance rate for every class label, 
we can use these rates to modulate the loss caused by different misclassifications. 

In detail, ^ in Eq. 17 is directly related to the hinge loss (1 — yuft According 
to the rescaling method [87], without loss of generality, we can rewrite the loss 
function into Eq. 18. 

- Vit X ibr{yu)) (1 - yuftiX^)) . (18) 



Let T = [tii, Ti2, • • • , nt, ■■■ , TmT], where Tu = - yu x ihriyu)). Then, to 

minimize the loss defined in Eq. 18, Eq. 17 becomes Eq. 19. Here indicates the 
weighted loss after considering the instance-level class-imbalance. It is evident that 
the problem in Eq. 19 is still a standard QP problem. 

s.t. yit{k^Xi)Oit + bt)>l- iiu 
^ >0, 

^k^ij)^t - Sit < fci(Xi)"t> 

Hi 

kl{Xi)^t - ^ Pijtkx{xij)^t < Sit- 

5.5 Efficient Algorithm 

Eq. 19 is a large-scale quadratic programming problem which involves lots of con- 
straints and variables. To make it tractable and scalable, and on observing that 
most of the constraints in Eq. 19 are redundant, we present an efficient algorithm 
which constructs a nested sequence of tighter relaxations of the original problem 
using the cutting plane method [39]. 

Similar to its use with structured prediction [64], we add a constraint (or a cut) 
that is most violated by the current solution, and then find the solution in the up- 
dated feasible region. Such procedure will converge to an optimal (or e-suboptimal) 
solution of the original problem. Moreover, Eq. 19 supports a natural problem de- 
composition since its constraint matrix is a block diagonal matrix, i.e., each block 
corresponds to one label. 

The pseudo-code of the algorithm is shown in Table 5. We first initialize the work- 
ing sets StS as empty sets and the solutions as all zeros (Line 1). Then, instead 
of testing all the constraints, which is rather expensive when there are lots of 
constraints, we use the speedup heuristic as described in [60], i.e., we use p con- 
straints to approximate the whole constraints (Line 4). Smola and Scholkopf [60] 
have shown that when p is larger than 59, the most violated constraint in / is with 
probability 0.95 among the 5% most violated constraints among all constraints. 
The LosSi (Line 5) is calculated as max{0, u'x — d} where u and d are the linear 



Table 5 

Efficient Algorithm for Eq. 19 



Input: K, A, n, 7, e, {Xi,Yi}^^ 

1 Vt, St = ^,vt = {aj^^ti,-- • ,^tm,Sti,-- • ,Stm,bt) = 

2 Repeat 

3 For t = 1, • • • ,r 

4 Pick p indexes of constraints that are not in St randomly, denoted by I; 

5 Compute Lossi for every constraint in /; 

6 % find out the cutting plane 

7 q = arg maxjg j Lossi 

8 If Lossq > £ 

9 St = 5t U M; 

10 Vt optimized over ^t; 

11 End If 

12 End For 

13 Until no St changes 



coefficients and bias of the i-th linear constraint, respectively. If the maximal Loss 
is lower than the given stopping criteria e (we simply set e as 10~^ in our exper- 
iments), no update will be taken for the working set Sf, otherwise the constraint 
with the maximal Loss will be added into St (lines 8 and 9) . Once a new constraint 
is added, the solution will be re-computed with respect to St via solving a smaller 
quadratic program problem (line 10). The algorithm stops when there is no update 
for all StS. 

5. 6 Experiments 

In this section, we compare the two SvM algorithms for MIML, that is, the "direct" 
algorithm D-MimlSvm and the "indirect" algorithm MimlSvm, on the scene 
classification data set used in Section 4.3.2 and the text categorization data set 
used in Section 4.3.3. 

To study the behavior of D-MimlSvm and MimlSvm under different amounts 
of multi-label data, wc derive five data sets from the scene data. By randomly 
removing some single- label images, we obtain a data set where 30% (or 40%, or 
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Fig. 7. Results on scene classification with different percentage of multi-label data. The 
lower the curve, the better the performance. 



50%) images belonging to multiple classes simultaneously; by randomly removing 
some multi-label images, we obtain a data set where 10% (or 20%) images belong 
to multiple classes simultaneously. A similar process is applied to the text data 
to derive five data sets. On the derived data sets we use 25% data for training 
and the remaining 75% data for testing, and experiments are repeated for thirty 
runs with random training/test partitions. The parameters of both MimlSvm and 
D-MimlSvm are set by hold-out tests on training sets. Since D-MimlSvm needs 
to solve a large optimization problem, although we have incorporated advanced 
mechanisms such as cutting-plane algorithm, the current D-MimlSvm can only 
deal with moderate size of training set. 

The seven criteria introduced in Section 4.3.1 are used to evaluate the performance. 
The average and standard deviation are plotted in Figs. 7 and 8. Note that in the 
figures we plot 1 — average precision, 1 — average recall and 1— average Fl such that 
in all the figures, the lower the curve, the better the performance. 

From Figs. 7 and 8 we can find that the performance of D-MimlSvm is apparently 
better than that of MimlSvm. The results suggest that D-MimlSvm is a good 
choice for learning with moderate number of MIML examples. 
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Fig. 8. Results on text categorization with different percentage of multi-label data. The 
lower the curve, the better the performance. 



5. 7 Discussion 

The regularization framework presented in this section has an important assump- 
tion, that is, all the class labels share some commonness, i.e., the Wo in Eq. 5. This 
assumption makes the regularization easier to realize, however, it over-simplifies 
the real scenario. In fact, in real applications it is rare that all class labels share 
some commonness; it is more typical that some class labels share some common- 
ness, but the commonness shared by different labels may be different. For example, 
class label yi may share something with class label y2, and ?/2 may share something 
with ?/3, but maybe yi shares nothing with y^. So, a more reasonable assumption 
is that different pairs of labels share different things (or even nothing). By taking 
this assumption, a more powerful method may be developed. 

Actually, it is not difficult to modify the framework of Eq. 7 by replacing the role 
of Wq by W whose element Wij expresses the relatedness between the i-th and 
j-th class labels, that is, 

min^E II - W-., r +^E/^^. II W^^J II' +7^ ■ (20) 

id i,j 



Note that W is a tensor and Wa is a vector. 



To minimize Eq. 20, taking derivative to W^, we have 



-{wi - Wij) - {wj - Wji) + 2iJLijWij + 2njiWji = . 

Considering = Wji and /lij — /iji, we have 

-{wi - Wij) - (wj - Wij) + A^iiijWij = , 

and so, 

^ !fi±^ _ (21) 

Put Eq. 21 into Eq. 20, we have 
After simphfication, Eq. 20 becomes 



1 / 16/x| + lO/i^j + 1 2/Xjj + 1 



1 ^ 2//jj- + 1 \ , 



4T^ {2piij + 1)2 
So, the new optimization task becomes 



^ ^ ^O,, . . _L 1 ^2 * ^T^^ 



s.t. yit{kj^Xi)<^t + bt)>l- 
I >0, 

fcx(Xi)"t - max k'j^^,.)CXt < 5^. 

By solving Eq. 23 we can get not only a MIML learner, but also some understanding 
on the relatedness between pairs of labels from Wj^, and some understanding on 

the different importance of the Wi/s in determining the concerned class label 
from yUjj's; this may be very helpful for understanding the ambiguous concepts 
underlying the task. Eq. 23, however, is difficult to solve since it involves too 



many variables. Thus, how to exploit/understand the pairwise relatedness between 
different pairs of labels remains an open problem. 

6 Solving Single-Instance Multi-Label Problems through MIML Trans- 
formation 

The previous sections show that when we have access to the raw objects and are 
able to represent ambiguous objects as MIML examples, using the MIML frame- 
work is beneficial. However, in many practical tasks we are given observational data 
where each ambiguous object has already been represented by a single instance, 
and we do not have access to the raw objects. In such case, we cannot capture 
more information from the raw objects using the MIML representation. Even in 
this situation, however, MIML is still useful. Here we propose the InsDif (i.e., 
INStance DIFferentiation) algorithm which transforms single-instance multi-label 
examples into MIML examples to exploit the power of MIML. 

6.1 iNSDiF 

For an object associated with multiple class labels, if it is described by only a 
single instance, the information corresponding to these labels are mixed and thus 
difficult to learn; if we can transform the single-instance into a set of instances in 
some proper ways, the mixed information might be detached to some extent and 
thus less difficult to learn. This is the motivation of InsDif. 

InsDif is a two-stage algorithm, which is based on instance differentiation. In the 
first stage, InsDif transforms each example into a bag of instances, by deriving 
one instance for each class label, in order to explicitly express the ambiguity of the 
example in the input space; in the second stage, an MIML learner is utilized to learn 
from the transformed data set. For the consistency with our previous description of 
the algorithm [85], in the current version of InsDif we use a two-level classification 
strategy, but note that other MIML algorithms such as D-MimlSvm can also be 
applied. 

Using the same denotation as that in Sections 3 and 4, that is, given data set 
S — {(xi, Yi), (a;2, i^2), • • • , (iCm) ^m)}) where Xi & X is an instance and Q y 



a set of labels {yu, yi2, • • • , yi,ii}, yik e y {k ^ 1,2, ■■■ , k). Here k denotes the 
number of labels in Y^. For the ease of discussion, assume that the number of 
possible labels |3^| = T, and the dimensionality ol Xi {i — 1,2, - ■ ■ , m) is d. 

In the first stage, InsDif derives a prototype vector vi for each class label I e y 
by averaging all the training instances belonging to I, i.e.. 



Here Vi can be approximately regarded as a profile-style vector describing common 
characteristics of the class I. Actually, this kind of prototype vectors have already 
shown their usefulness in solving text categorization problems. For example, the 
ROCCHIO method [33,58] forms a prototype vector for each class by averaging all 
the documents (represented by weight vectors) of this class, and then classifies the 
test document by calculating the dot-products between the weight vector repre- 
senting the document and each of the prototype vectors. Here we use such kind 
of prototype vectors to facilitate bag generation. After obtaining the prototype 
vectors, each example Xi is re-represented by a bag of instances Bi, where each in- 
stance in Bi expresses the difference between Xi and a prototype vector according 
to Eq. 25. In this way, each example is transformed into a bag whose size equals 
to the number of class labels. 



In fact, such a process attempts to exploit the spatial distribution since Xi — Vi 
in Eq. 25 is a kind of distance between Xi and Vi. The transformation can also be 
realized in other ways. For example, other than referring to the prototype vector 
of each class, we can also go along with the following way. For each possible class 
I, identify the /c-nearest neighbors of Xi among training instances that have I as 
a proper label. Then, the mean vector of these neighbors can be regarded as an 
instance in the bag. Note that the transformation of a single instance into a bag of 
instances can be realized as a general pre-processing method which can be plugged 
into many learning systems. 




(24) 



where 



Si^{xi\{xi,Yi}eS, leYi}, ley. 



Bi = {xi - vi\l e y} 



(25) 



outputs 




Fig. 9. The two-level classification structure used by InsDif 

In the second stage, InsDif learns from the transformed training set S* = {{Bi, Yi), 
(i?2,^2), ■ ■ ■ , {Bm,Ym)}- This task is accomplished by the two-level classification 
structure shown in Fig. 9. Input to the structure is a bag B consisting of n in- 
stances {bi, 62, ■ ■ ■ , bn}, where each instance 6j is a rf-dimensional feature vector 
[6ji, bi2, ■ ■ ■ , foid]^- Outputs of the structure consist of T real values {yi,y2, ■ ■ ■ , Vt}, 
where each output yi corresponds to a label / G 3^. The first level is composed of 
M bags {Ci,C2,--- ,Cm}, where each bag Cj is the medoid of group Gj. Here 
{Gi,G2, ■ ■ ■ ,Gm} partition the transformed training set S* into disjoint groups 
of bags with [jfLiGj = {Bi, B2, ■ ■ ■ ,-Bm} and Gj fli^j = 0. The second level 
weights W = [wjiImxt connect each medoid Cj in the first level to each output yi. 

By regarding each bag as an atomic object, we adapt the popular /c-medoids al- 
gorithm to cluster S* into M disjoint groups of bags. Here, we employ the Haus- 
dorff distance [26] shown in Eq. 2 to measure the distance between bags. For 
categorical data, distance metric such as the Value Difference Metric (Vdm) [62] 
can be used. After this process, S* is divided into M partitions and the medoids 
(j = l,2,--- ,M) are 

Cj = argmin V dniA^B) . (26) 

Since clustering can help to find the underlying structure of a data set, the medoid 
of each group may encode some distributional information of different bags. With 
the help of these medoids, each bag B can be converted into an M-dimensional 



feature vector 02(-B), • • • ,0m(-B)]^ with (f)j{B) — dniB^Cj). The second 

level weights W = [wj]\MxT are optimized by minimizing the following sum-of- 
squares error function 

1 m T 

E = ^Y.Y.{yi^B,) - (27) 

^ i=l 1=1 

where yi{Bi) — J2jLi tJ^jifpjiBi) is the actual output of the structure on Bi on the 
class I; d} is the desired output of B^ on the class I, which takes the value of +1 
a I E Yi and —1 otherwise. Differentiating the objective function in Eq. 27 with 
respect to Wji and setting the derivative to zero gives the normal equations for the 
least-squares problem as 

($^$)W = , (28) 

where # = [0ij]mxM is with elements (pij — (pj{Bi) and T = [tii]mxT is with 
elements tn — d\. Here we compute the second layer weights W by solving Eq. 28 
using singular value decomposition. 

The pseudo-code of InsDif is summarized in Table 6. In the first stage (Steps 1 to 
2), InsDif transforms each example into a bag of instances by querying the class 
prototype vectors. In the second stage (Steps 3 to 5), the two-level classification 
structure is used to learn from the transformed data. Note that the two-level clas- 
sification structure is actually a MIML learner, and any MIML algorithm can be 
used to reahze the second stage of InsDif. Test example x* is transformed into 
the corresponding bag representation B* and then fed to the learned classification 
structure for prediction. 

6.2 Experiments 

We compare InsDif with several state-of-the-art multi-label learning algorithms, 
including AdtBoost.MH [22], RankSvm [27], MlSvm [11], Ml-A;nn [80] and 
Cnmf [42]; these algorithms have been introduced briefly in Section 2. 

Note that the experiments here are very different from that in Sections 4.3 and 
5.6. In Sections 4.3 and 5.6, it is assumed that the data are MIML examples; 
while in this section, it is assumed that we are given observational data where each 
raw object has already been represented as a single instance. In other words, in 



Table 6 

The InsDif algorithm 



1 For single-instance multi-label examples {xu, Yu) {u = 1,2, ■ ■ ■ ,m), compute the 
prototype vectors vi {I £ y) using Eq. 24. 

2 Derive the new training set S* by transforming each Xi into a bag of instances Bi 
using Eq. 25. 

3 Divide {Bi, B2, - ■ ■ , Bm} into M partitions using A;-medoids algorithm employing 
Hausdorff distance. 

4 Determine the medoids Cj {j = 1,2, ■ ■ ■ , M) using Eq. 26. 

5 Compute the weights W by solving Eq. 28 using singular value decomposition. 

6 Return Y* = {l\yiiB*) = ^fL^ wjKPjiB*) > 0, I £ y}, where B* = {x* - vi\l e y}. 



this section we are trying to learn from single- instance multi-label examples, and 
therefore the experimental data sets are different from those used in Sections 4.3 
and 5.6. 



6.2.1 Yeast Gene Functional Analysis 

The task here is to predict the gene functional classes of the Yeast Saccharomyces 
cerevisiae, which is one of the best studied organisms. Specifically, the Yeast data 
set investigated in [27,80] is studied. Each gene is represented by a 103-dimensional 
feature vector generated by concatenating a gene expression vector and the corre- 
sponding phylogenetic profile. Each 79-element gene expression vector reflects the 
expression levels of a particular gene under two different experimental conditions, 
while the phylogenetic profile is a Boolean string each bit indicating whether the 
concerned gene has a close homolog in the corresponding genome. Each gene is 
associated with a set of functional classes whose maximum size can be potentially 
more than 190. Elisseeff and Weston [27] have preprocessed the data set where only 
the known structure of the functional classes are used. Actually, the whole set of 
functional classes is structured into hierarchies up to 4 levels deep. The first level 
of the hierarchy is shown in Fig. 10. The resulting multi-label data set contains 



See http://mips.gsf.de/proj/yeast/catalogues/funcat/ for more details. 
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Fig. 10. The first level of the hierarchy of the Yeast gene functional classes. One gene, 
for instance the one named YAL062w, can belong to several classes (shaded in gray) of 
the fourteen possible classes. 



2,417 genes, fourteen possible class labels and the average number of labels for each 
gene is 4.24 ± 1.57. 

For InsDif, the parameter M is set to be 20% of the size of training set; it can be 
found from Fig. 11 that the performance of iNsDiF is not sensitive to the setting 
of M. The number of boosting rounds of AdtBoost.MH is set to 50 according to 
Section 4.3.2. For RankSvm, MlSvm, Ml-A;nn and Cnmf, the best performed 
parameters reported in [27], [11], [80] and [42] are used, respectively. The criteria 
introduced in Section 4.3.1 are used to evaluate the learning performance. Ten-fold 



cross- va. 
Table 7 
boldface 



idation is conducted on this data set and the results are summarized in 



where the best performance on each criterion has been highlighted in 



Table 7 shows that InsDif performs quite well on all evaluation criteria. Pairwise 
t-tests with 95% significance level disclose that InsDif is significantly better than 



Hamming loss, average recall and average Fl are not available for Cnmf; ranking loss, 
average recall and average Fl are not available for AdtBoost.MH. 



Table 7 



Results (meanibstd.) on yeast gene data set ('|' indicates 'the smaller the better'; '|' 
indicates 'the larger the better') 



Evaluation Criteria 

Compared 



hloss ^ one-error ^ coverage ^ rloss ^ aveprec T avgrecl ^ avgFl T 



InsDif 


.189±.010 


.214±.030 


6.288±0.240 


.163±.017 


.774±.019 


.602±.026 


.677±.023 


AdtBoost.MH 


.207±.010 


.244±.035 


6.390±0.203 


N/A 


.744±.025 


N/A 


N/A 


RankSvm 


.207±.013 


.243±.039 


7.090±0.503 


.195±.021 


.749±.026 


.500±.047 


.600±.041 


MlSvm 


.199±.009 


.227±.032 


7.220±0.338 


.201±.019 


.749±.021 


.572±.023 


.649±.022 


ML-fcNN 


.194±.010 


.230±.030 


6.275±0.240 


.167±.016 


.765±.021 


.574±.022 


.656±.021 


Cnmf 


N/A 


.354±.184 


7.930±1.089 


.268±.062 


.668±.093 


N/A 


N/A 
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Fig. 11. Performances of InsDif with different M settings. 



all the compared algorithm on all criteria, except that on coverage it is worse 
than Ml-Zcnn but the difference is not statistical significant. It is noteworthy that 
Cnmf performs quite poor compared to other algorithms although it has used test 
set information. The reason may be that the key assumption of Cnmf, i.e., two 
examples with high similarity in the input space tend to have large overlap in the 
output space, does not hold on this gene data since there are some genes whose 
functions are quite different but the physical appearances are similar. 

Overall, results on the Yeast gene functional analysis task suggest that MIML can 
be useful when we are given observational data where each ambiguous object has 
already been represented by a single instance. 



Table 8 

Characteristics of the web page data sets (after term selection). PMC denotes the per- 
centage of documents belonging to more than one category; ANL denotes the average 
number of labels for each document; PRC denotes the percentage of rare categories, i.e., 
the kind of category where only less than 1% instances in the data set belong to it. 



Data Set 


Number of 


Vocabulary 


Training 


Sot 




Test Sot 




(; f" o crn 1* 1 r> G 


Size 


PMC 


ANL 






A ATT 




Art s& Humanities 


ZD 




44.50% 


1.627 


ly.Zo /O 


4o.DO/o 




iy.Zo /o 


Business&Economy 


30 


438 


42.20% 


1.590 


50.00% 


41.93% 


1.586 


43.33% 


Computers&Internet 


33 


681 


29.60% 


1.487 


39.39% 


31.27% 


1.522 


36.36% 


Education 


33 


550 


33.50% 


1.465 


57.58% 


33.73% 


1.458 


57.58% 


Entertainment 


21 


640 


29.30% 


1.426 


28.57% 


28.20% 


1.417 


33.33% 


Health 


32 


612 


48.05% 


1.667 


53.13% 


47.20% 


1.659 


53.13% 


Recreation&Sports 


22 


606 


30.20% 


1.414 


18.18% 


31.20% 


1.429 


18.18% 


Reference 


33 


793 


13.75% 


1.159 


51.52% 


14.60% 


1.177 


54.55% 


Science 


40 


743 


34.85% 


1.489 


35.00% 


30.57% 


1.425 


40.00% 


Social&Science 


39 


1 047 


20.95% 


1.274 


56.41% 


22.83% 


1.290 


58.97% 


Society&Culture 


27 


636 


41.90% 


1.705 


25.93% 


39.97% 


1.684 


22.22% 



6.2.2 Web Page Categorization 

The web page categorization task has been studied in [38,65,80]. The web pages 
were collected from the "yahoo.com" domain and then divided into 11 data sets 
based on Yahoo's top-level categoriesJ ^ I After that, each page is classified into a 
number of Yahoo's second- level subcategories. Each data set contains 2,000 training 
documents and 3,000 test documents. The simple term selection method based on 
document frequency (the number of documents containing a specific term) was 
applied to each data set to reduce the dimensionality. Actually, only 2% words 
with the highest document frequency were retained in the final vocabulary. [£J Other 
term selection methods such as information gain and mutual information can also 
be adopted. After term selection, each document in the data set is described as a 
feature vector using the ''Bag- of- Words" representation, i.e., each feature expresses 
the number of times a vocabulary word appearing in the document. 

Table 8 summarizes the characteristics of the web page data sets. Comparing with 
the Yeast data in Section 6.2.1, here the instances are represented by much higher 



^ Data set available at 'http: / / www.kecl.ntt.co.jp/ as / members /ueda/yahoo.tar .gz ', 

^ Yang and Pedersen [73] have shown that based on document frequency, it is possible 

to reduce the dimensionality by a factor of 10 with no loss in effectiveness and by a factor 

of 100 with just a small loss. 
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Fig. 12. Results on the eleven Yahoo data sets. 



dimensional feature vectors and a large portion of them (about 20% ~ 45%) are 
multi-labeled. Moreover, here the number of categories (21 ~ 40) are much larger 
and many of them are rare categories (about 20% ~ 55%). So, the web page data 
sets are more difficult than the Yeast data to learn. 



In the experiments the parameter settings are similar as that in Section 6.2.1. 
Results of the eleven data sets are shown in Fig. 12, and the average results are 
summarized in Table 9 where the best performance on each criterion has been 
highlighted in boldface. 



Table 9 

Results (meanzbstd.) on eleven web page categorization data sets ('|' indicates 'the 
smaller the better'; 'f indicates 'the larger the better') 



Evaluation Criteria 

Compared 



Algorithms 


hloss ^ 


one-error ^ 


coverage ^ 


rloss ^ 


aveprec ^ 


avgrecl ^ 


aveFl T 


InsDif 


.039±.013 


.381±.118 


4.545±1.285 


.102±.037 


.686±.091 


.377±.163 


.479±.154 


AdtBoost.MH 


.043±.013 


.461±.137 


4.083±1.191 


N/A 


.632±.105 


N/A 


N/A 


RankSvm 


.043±.014 


.440±.143 


7.508±2.396 


.193±.065 


.605±.117 


.243±.175 


.333±.179 


MlSvm 


.042±.015 


.375±.119 


6.919±1.767 


.168±.047 


.660±.093 


.378±.167 


.472±.156 


Ml-A;nn 


.043±.014 


.471±.157 


4.097±1.236 


.102±.045 


.625±.116 


.292±.189 


.381±.196 


Cnmf 


N/A 


.509±.142 


6.717±1.588 


.171±.058 


.561±.114 


N/A 


N/A 



Table 9 shows that InsDif performs quite well on almost all evaluation criteria. 
Pairwise i-tests with 95% significance level disclose that InsDif is significantly 
better than all the compared algorithm on hamming loss, average precision and 
average Fl; on ranking loss it is comparable to Ml-/cnn, significantly better than 
all other algorithms; on average recall it is comparable to MlSvm, significantly 
better than all other algorithms; on one- error it is worse than MlSvm, but sig- 
nificantly better than all the other three algorithms; on coverage it is worse than 
AdtBoost.MH and Ml-/cnn, but significantly better than the other three algo- 
rithms. 

Overall, results on the web page categorization task suggest that MIML can be 
useful when we are given observational data where each ambiguous object has 
already been represented by a single instance. 

7 Solving Multi-Instance Single-Label Problems through MIML Trans- 
formation 

In many tasks we are given observational data where each object has already been 
represented as a multi-instance single-label example, and we do not have access to 
the raw objects. In such case, we cannot capture more information from the raw 
objects using the MIML representation. Even in this situation, however, MIML 
is still useful. Here we propose the SubCod (i.e., SUB-COncept Discovery) algo- 
rithm which transforms mult i- instance single-label examples into MIML examples 
to exploit the power of MIML. 



1.1 SubCod 



For an object that has been described by multi-instances, if it is associated with a 
label corresponding to a high-level complicated concept such as Africa in Fig. 3(a), 
it may be quite difficult to learn this concept directly since many different lower- 
level concepts are mixed. If we can transform the single-label into a set of labels 
corresponding to some sub-concepts, which are relatively clearer and easier to learn, 
we can learn these labels at first and then derive the high-level complicated label 
based on them, as illustrated in Fig. 3(b). This is the motivation of SubCod. 

SubCod is a two-stage algorithm, which is based on sub-concept discovery. In 
the first stage, SubCod transforms each single-label example into a multi-label 
example by discovering and exploiting sub-concepts involved by the original label; 
this is realized by constructing multiple labels through unsupervised clustering all 
instances and then treating each cluster as a set of instances of a separate sub- 
concept. In the second stage, the outputs learned from the transformed data set 
are used to derived the original labels that are to be predicted; this is reahzed 
by using a supervised learning algorithm to predict the original labels from the 
sub-concepts predicted by an MIML learner. 

Using the same denotation as that in Sections 3 and 4, that is, given data set 
{{Xi, Hi), {X2, ^2), • • • , i^m, Urn)}, where Xi C X is a set of instances {xn, Xi2, ■■■ , 
Xi^m}, Xij e X {j — 1, 2, • • • , rij), and yi E y is the label of X^. Here rii denotes 
the number of instances in X^. 

In the first stage, SubCod collects all instances from all the bags to compose a 
data set D = {xu, ■■■ , cci,^, X21, ■•• , X2,n2, • • • > ^mi, ■■• , Xm,nm}- For the ease of 
discussion, let N — J^^i a-nd re-index the instances in D as {xi, X2, - ■ ■ , Xn}- 
A Gaussian mixture model with M mixture components is to be learned from D 
by the EM algorithm, and the mixture components are regarded as sub-concepts. 
The parameters of the mixture components, i.e., the means fik, covariances 
and mixing coefficients tt^ {k = 1,2,- ■■ ,M), arc randomly initialized and the 
initial value of the log-likelihood is evaluated. In the E-step, the responsibilities 



are measured according to 

lik^-^ U = 1,2,--- ■ (29) 

E 'KjN{Xi\lXj, Ej) 

In the M-step, the parameters are re-estimated according to 

AT 

new ^=l 

H'k — ~ > 

E lik 
1=1 

N 

■^new i=l 

^k - N 

E lik 

i=l 

iV 

E lik 

„new i=l 

and the log-hkehhood is evaluated according to 

AT / M 

Inp (L>|/u, E, tt) = ^ In K: Tr^^A^ (a;>r , ) 
i=i \fe=i 

After the convergence of the EM process (or after a pre-specified number of it- 
erations), we can estimate the associated sub-concept for every instance Xi & D 
(i = l,2,---,7V)by 

sc{xi) = argmax7jfc (/c = 1, 2, • • • , M) . (34) 

k 

Then, we can derive the multi-label for each (i = 1, 2, ■ ■ ■ , m) by considering the 
sub-concept belongingness. Let Cj denotes a M-dimensional binary vector where 
each element is either +1 or —1. For j — 1, 2, • • • , M, Cij — +1 means that the 
sub-concept corresponding to the j-th Gaussian mixture component appears in 
Xi, while Cij — —1 means that this sub-concept does not appear in X^. Here the 
value of Cij can be determined according to a simple rule that Cij = +1 ii Xi has 
at least one instance which takes the j'-th sub-concept (i.e., satisfying Eq. 34); 
otherwise Cy = — 1. Note that for examples with identical single-label, the derived 
multi-labels for them may be different. 



(30) 

(31) 
(32) 
(33) 



The above process works in an unsupervised way which does not consider the 
original labels of the bags XiS. Thus, the derived multi-labels Cj need to be polished 
by incorporating the relation between the sub-concepts and the original label of 
Xi. Here the maximum margin criterion is used. In detail, consider a vector Zj with 

elements Zij G [—1.0, +1.0] (j = 1,2,- ■■ ,M); Zij = +1 means that the label Cij 
should not be modified while Zij = — 1 means that the label Cij should be inverted. 
Denote Qi = Ci Q Zi as that for j' = 1, 2, • • • , M, q-ij = CijZij. Let 9 denote the 
smallest number of labels that cannot be inverted. SubCod attempts to optimize 
the objective 



mm -\\w\\l + Cj2^i (35) 

w,b,$,Z 2 ~[ 



s.t. Viiw'^ici QZi) + b)>l- Ci, 
Y,Zij> 29-1, 



where Z = [zi, Z2, ■■■ ,Zm]- 

By solving Eq. 35 wc will get the vector Zj which maximizes the margin of the 
prediction of the proper labels of Xj. Here we solve Eq. 35 iteratively. We initialize 
Z with all I's. First, we fix Z to get the optimal w and b; this is a standard QP 
problem. Then, we fix w and b to get the optimal Z; this is a standard LP problem. 
These two steps are iterated till convergence. Finally, we set the multi-label vector's 
elements which correspond to positive CijZij^s {i — 1,2, • • ■ ,m; j — 1,2, • ■ • , M) to 
+1, and set the remaining ones to —1. Thus, we get all the pohshed multi-label vec- 
tor Cj for the bags Xj. Thus, the original data set {{Xi.yi), {X2, 1/2)7 ■■ ■ , {^m, ym)} 
is transformed to a MIML data set {(Xi, Ci), (X2, C2), ■ ■ ■ ,{Xm,Cm)}, and any 
MIML algorithms can be applied. 

To map the multi-labels predicted by the MIML classifier for a test example to the 
original single-labels y E y, in the second stage of SubCod, a traditional classifier 
/ : {+1, -1}*^ ^ 3^ is generated from the data set {(ci, yi), (62, ^2), ■ ■ ■ , (Cm, Vm)}- 
This is relatively simple and traditional supervised learning algorithms can be 
applied. 

The pseudo-code of SubCod is summarized in Table 10. In the first stage (Steps 



Table 10 

The SubCod algorithm 



1 For multi- instance single-label examples {Xu, Vu) = !> 2, • • • , m), collect all the 
instances x G together and identify the Gaussian mixture components through 
the EM process detailed in Eqs. 29 to 33. 

2 Determine the sub-concept for every instance x G Xu according to Eq. 34, and 
then derive the label vector for Xu- 

3 Make corrections to c„ by optimizing Eq. 35, which results in for Xu, and then 
train a MIML learner ht{X) on {{Xu, Cu)} {u = 1,2, ■ ■ ■ ,m). 

4 Train a classifier hy{c) on {(c„, ?/„)} (u = 1, 2, • • • , m), which maps the derived 
multi-labels to the original single-labels. 

5 ReiMT-a.y* = hy{ht{X*)). 



1 to 3), SubCod derives multi-labels via sub-concept discovery and transforms 
single-label examples into a MIML examples, from which a MIML learner is gen- 
erated. In the second stage (Steps 4), a traditional classifier is trained to map 
the derived multi-labels to the original single-labels. Test example X* is fed to 
the MIML learner to get its multi-labels, and the multi-labels are then fed to the 
supervised classifier to get the label y* predicted for X* . 

7.2 Experiments 

We compare SubCod with several state-of-the-art multi-instance learning algo- 
rithms, including DIVERSE DENSITY [44], Em-dd [83], Ml-SVM and Ml-SVM [3], 
and Ch-Fd [30]; these algorithms have been introduced briefly in Section 2. 

Note that the experiments here are very different from that in Sections 4.3, 5.6 and 
6.2. Both Sections 4.3 and 5.6 deal with learning from MIML examples. Section 6.2 
deals with learning from single-instance multi-label examples, while this section 
deals with learning from multi-instance single-label examples, and therefore the 
experimental data sets in this section are different from those used in Sections 4.3, 
5.6 and 6.2. 



Table 11 

Predictive accuracy on five multi-instance benchmark data sets 



^ J Data sets 
Compared 



A 1 OT^Tii'hTnG 

JT-iPjUl lUlllllO 


Muskl 


Musk2 


Elephant 


Tiger 


Fox 


SubCod 


85.0% 


92.1% 


83.6% 


80.8% 


61.6% 


Diverse Density 


88.0% 


84.0% 


N/A 


N/A 


N/A 


Em-dd 


84.8% 


84.9% 


78.3% 


72.1% 


56.1% 


MI-SVM 


87.4% 


83.6% 


82.0% 


78.9% 


58.2% 


Ml-SVM 


77.9% 


84.3% 


81.4% 


84.0% 


59.4% 


Ch-Fd 


88.8% 


85.7% 


82.4% 


82.2% 


60.4% 



Five benchmark multi-instance learning data sets are used, including Muskl, Musk2, 
Elephant, Tiger and Fox. Both Muskl and Musk2 arc drug activity prediction data 
sets, publicly available at the UCI machine learning repository [8]. Here every bag 
corresponds to a molecule, while every instance corresponds to a low-energy shape 
of the molecule [24]. Muskl contains 47 positive bags and 45 negative bags, and the 
number of instances contained in each bag ranges from 2 to 40. Musk2 contains 
39 positive bags and 63 negative bags, and the number of instances contained 
in each bag ranges from 1 to 1,044. Each instance is a 166-dimensional feature 
vector. Elephant, Tiger and Fox are three image annotation data sets generated 
by [3] for multi-instance learning. Here every bag is an image, while every instance 
corresponds to a segmented region in the image [3]. Each data set contains 100 pos- 
itive and 100 negative bags, and each instance is a 230-dimensional feature vector. 
These data sets are popularly used in evaluating the performance of multi-instance 
learning algorithms. 

We follow the benchmark experiment design and report average accuracy of 10 runs 
of ten-fold cross validation. Here the parameters of SubCod are determined by 
hold-out tests on training sets, and the candidate value ranges are [10, 70] for M, the 
number of Gaussian mixture components, and [10% x ^labels, 70% x ^labels ] for 
9, the smallest number of derived labels that cannot be inverted. MimlSvm is used 
to realize the MIML learner in Step 3 of SubCod, and the candidate value range is 
[200,400] for parameter k, the number of intermediate clusters in MimlSvm. The 
classifier hy in Step 4 of SubCod is realized by Smo with default parameters. The 



results of the compared algorithms are the best performance reported in literatures 
[3,30]. The comparison is summarized in Table 11, where the best performance on 
each data set has been highhghted in boldface. 

Table 11 shows that SubCod is very competitive to state-of-the-art multi-instance 
learning algorithms. In particular, on Musk2 its performance is much better than 
other algorithms. This is expectable because Musk2 is a comphcated data set 
which has the largest number of instances, while on such data set the sub-concept 
discovery process of SubCod may be more effective. 

Overall, the experimental results suggest that MIML can be useful when we arc 
given observational data where each object has already been represented as a multi- 
instance single-label example. 



8 Conclusion 

This paper extends our preliminary work [81, 92] to formalize the MIML Multi- 
Instance Multi-Label learning framework for learning with ambiguous objects, where 
an example is described by multiple instances and associated with multiple class 
labels. It was inspired by the recognition that when solving real- world problems, 
having a good representation is often more important than having a strong learning 
algorithm because a good representation may capture more meaningful information 
and make the learning task easier to tackle. Since many real objects are inherited 
with input ambiguity as well as output ambiguity, MIML is more natural and 
convenient for tasks involving such objects. 

To exploit the advantages of the MIML representation, we propose the MlML- 
BOOST algorithm and the MimlSvm algorithm based on a simple degeneration 
strategy. Experiments on scene classification and text categorization show that 
solving problems involving ambiguous objects under the MIML framework can lead 
to good performance. Considering that the degeneration process may lose informa- 
tion, we also propose the D-MimlSvm algorithm which tackles MIML problems 
directly in a regularization framework. Experiments show that this "direct" SVM 
algorithm outperforms the "indirect" MimlSvm algorithm. 



In some practical tasks we are given observational data where each ambiguous 
object has already been represented by a single instance, and we do not have 
access to the raw objects such that we cannot capture more information from the 
raw objects using the MIML representation. For such scenario, we propose the 
InsDif algorithm which transforms single-instances into MIML examples to learn. 
Experiments on Yeast gene functional analysis and web page categorization show 
that such algorithm is able to achieve a better performance than learning the single- 
instances directly. This is not difficult to understand. Actually, the underlying task 
of (single-instance) multi-label learning is to learn an one-to-many mapping yet 
one-to-many mappings are not mathematical functions. If we represent the multi- 
label object using multi-instances, the underlying task becomes to learn a many- 
to-many mapping which is realizable by mathematical functions. So, transforming 
multi-label examples to MIML examples for learning may be beneficial in some 
tasks. 

MIML can also be helpful for learning single-label examples involving complicated 
high-level concepts. Usually it may be quite difficult to learn such concepts directly 
since many different lower-level concepts are mixed together. If we can transform 
the single-label into a set of labels corresponding to some sub-concepts, which are 
relatively clearer and easier to learn, we can learn these labels at first and then 
derive the high-level complicated label based on them. Inspired by this recognition, 
we propose the SubCod algorithm which works by discovering sub-concepts of 
the target concept at first and then transforming the data into MIML examples to 
learn. Experiments show that this algorithm is able to achieve better performance 
than learning the single-label examples directly in some tasks. 

We believe that semantics exist in the connections between atomic input patterns 
and atomic output patterns; while a prominent usefulness of MIML, which has 
not been realized in this paper, is the possibility of identifying such connection. 
As illustrated in Fig. 2(b), in the MIML framework it is possible to understand 
why a concerned object has a certain class label; this may be more important 
than simply making an accurate prediction, because the results could be helpful 
for understanding the source of ambiguous semantics. 
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